Screening properties of trend tests in genetic association studies

In genome-wide association study, extracting disease-associated genetic variants among millions of single nucleotide polymorphisms is of great importance. When the response is a binary variable, the Cochran-Armitage trend tests and associated MAX test are among the most widely used methods for association analysis. However, the theoretical guarantees for applying these methods to variable screening have not been built. To fill this gap, we propose screening procedures based on adjusted versions of these methods and prove their sure screening properties and ranking consistency properties. Extensive simulations are conducted to compare the performances of different screening procedures and demonstrate the robustness and efficiency of MAX test-based screening procedure. A case study on a dataset of type 1 diabetes further verifies their effectiveness.

With the development of high throughput sequencing techniques, hundreds of thousands of single nucleotide polymorphisms (SNPs) in the genome are recorded, which enables researchers to investigate and treat diseases from the perspective of genetic variants. To identify the disease-related genes or genetic markers among all these SNPs, genome-wide association study (GWAS) is a widely used strategy. Up to now, more than one hundred thousands of SNPs have been identified to be related to many traits [1][2][3][4][5][6][7] .
The commonly used GWAS tests the association between the phenotype and each SNP sequentially, obtains a series of test statistics or p-values, and selects the associated SNPs by comparing these statistics or p-values with a given threshold. When the phenotype is binary, Cochran-Armitage trend test (CATT) 8 is always used to detect the associated SNPs. It has been shown that when the underlying genetic model is known, where the commonly used ones are recessive, additive or dominant models, CATT has an optimal form 9,10 . However, the true genetic models are always unknown and may be very complicated. For the sake of robustness, an omnibus test called MAX is proposed 11,12 , which uses the maximum of CATTs under different genetic models as a measure for association. The asymptotical distribution of MAX is given in the work of Zheng et al. 13 . Since its being raised, MAX has been widely used and investigated. Li et al. 14 introduced a selection procedure based on the rank of MAX. Kim et al. 15 proposed a SNP selection method based on MAX and a penalized support vector machine strategy.
Though CATTs and MAX have concise forms and are extensively used, theoretical properties for the applications of CATTs and MAX to GWAS have not been investigated. To control false discovery rate (FDR) in GWAS, Bonferroni correction strategy and FDR control procedures, such as Benjamini-Hochberg procedure, are two widely used strategies. But they both assume that all the SNPs are independent, which certainly is improperly since linkage disequilibrium usually exists among SNPs and may lead to omission on related SNPs. Considering these drawbacks, feature screening methods are sensible alternatives. Rather than select the associated SNPs directly, feature screening approaches aim to eliminate most of the irrelevant SNPs at first. After a screening procedure, there remains only a small amount of SNPs and researchers can concentrate on these remaining SNPs, which can save much time and work.
In the last few years, feature screening methods have been proposed for various situations. Fan and Lv 16 first proposed a screening method called the sure independence screening approach for Gaussian response and predictors under linear regressions. Since then, sure screening property, which retains all the important predictors with high probability as the sample size goes into infinity, has been regarded as a feature screening criterion. Many screening procedures have been developed for diverse models, such as the generalized linear model 17 and additive model 18 among others. Although many procedures can be directly applied to GWAS with corresponding models and data types, only PC-SIS, proposed in the work of Huang et al. 19 , is applicable to the considered situation where both the outcome and predictors are categorical. However, PC-SIS does not take the www.nature.com/scientificreports/ information on genetic model into consideration. Just as mentioned above, CATTs and MAX test consider this information in the association analysis. But their screening properties have not been studied yet. To fill this gap, we propose feature screening methods based on CATTs in different genetic models and MAX test, and investigate their sure screening and rank consistency properties. The rest of paper is organised as follows. In "Trend test", we briefly describe the trend tests which can be used to evaluate the relationship between a binary variable and a genotype variable. "Independence screening procedure" introduces the independence screening procedures based on the adjusted trend test statistics, and presents sure screening and ranking consistency properties. Simulation studies are conducted in "Simulation studies" . And a case study on type 1 diabetes is demonstrated in "Application to a real dataset". A conclusion for this work is presented in "Conclusion". All proofs of theorems are provided in the Supplemental Materials.

Trend test
CATT evaluates the association between a binary variable and a SNP, and is widely used in case-control genetic data analysis. Compared with Pearson chi-square test, it makes use of the underlying genetic model. Its specific form is as follows. Suppose r cases and s controls are enrolled in the study. For a given SNP, the genotypes can be expressed as aa, Aa and AA, respectively, with A being a high risk candidate allele. In the sample of cases, the counts of aa, Aa and AA are r 0 , r 1 and r 2 , respectively. And the corresponding counts in the control samples are s 0 , s 1 and s 2 . Thus we have r = r 0 + r 1 + r 2 , s = s 0 + s 1 + s 2 . Denote n = r + s and n i = r i + s i for i = 0, 1, 2 . All these counts are displayed in Table 1. Then CATT can be written as where (X 0 , X 1 , X 2 ) is a pre-defined genotype score vector. Note that the optimal score vector for CATT varies across different genetic models. Specifically, for the commonly encountered recessive genetic model, additive genetic model and dominant genetic model, the optimal genotype score vectors are (0, 0, 1), (0, 1 2 , 1) and (0, 1, 1), respectively. And the respective corresponding CATT can be denoted as Z 0 , Z 1 2 and Z 1 . Under the null hypothesis of no association, these three CATTs above are asymptotically normally distributed as N(0, 1).
However, in practice, the true genetic model is unknown. Thus none of Z 0 , Z 1 2 and Z 1 is robust in all situations. To tackle this issue, the statistic MAX is proposed as By using the maximum of absolute values of Z 0 , Z 1 2 and Z 1 , Z max obtains robustness under diverse situations.

Independence screening procedure
Screening procedure. CATTs and MAX test are designed for testing the relationship between a binary response and a SNP variable. We apply them to feature screening task and display their properties.
Suppose G = (G 1 , G 2 , . . . , G m ) ⊤ is a m-dimensional SNP vector and Y is a binary response which is 1 for a case sample and 0 for a control sample. Denote P(Y = 1) = p and P(Y = 0) = q , where p + q = 1. Our aim is to identify the SNPs among all the m SNPs that are related with Y. In accordance with practice, each SNP takes value in {0, 1, 2} , corresponding to genotypes aa, Aa and AA, respectively.
Given the above notations, the empirical estimators of ω 0,k , ω 1 2 ,k , ω 1,k , and ν k for k ∈ {1, 2, . . . , m} are and where p ik ,q ik ,p,q,f ik are the empirical estimators of p ik , q ik , p, q, f ik , and can be estimated as Plug them into the expression, ω j,k has the form Note that ω j,k = Z j,k √ n , where Z 0,k , Z 1 2 ,k and Z 1,k are CATT statistics between G k and Y for the pre-defined score vector (X 0 , X 1 , X 2 ) being (0, 0, 1), (0, 1 2 , 1) and (0, 1, 1), respectively. So ω j,k is an adjusted version of Z j,k , whose value range is not effected by sample size. And ν k maintains the ranking result of Z max,k for each predictor. Large values of ν k indicate the existence of association between G k and Y. We denote ω j,k as aCATT and ν k as aMAX.
Assume that only a small part of SNPs are related with the response Y. We use aCATT |ω j,k | s and aMAX ν k s to identify their positions. The screening procedures based on |ω 0,k | s, |ω Screening properties. We call a SNP as an active SNP if it is associated with the response Y. Define different index sets of active SNPs based on different measures by www.nature.com/scientificreports/ Their estimated truncated active index sets can be expressed as where c > 0 and τ > 0 are two pre-specified constants and satisfy some certain conditions. Now we investigate the theoretical properties of the screening procedures of Â * j and Â * s. First list some conditions. Condition 1 (C1) There exists constants 0 < ζ min ≤ ζ max < 1 such that for i = 0, 1, 2 and k = 1, 2, . . . , m , if Then we present the sure screening properties based on aCATT and aMAX in Theorem 1 and 2, whose proofs are shown in Supplemental Materials.  www.nature.com/scientificreports/ In practice, c and τ are hard to be determined to satisfy the condition that the estimated truncated active index sets contain the corresponding active index sets. So it is common to select SNPs corresponding to the first d largest statistic values as related SNPs, where d is a pre-defined constant. That is, the respective estimated active index sets have the following forms and We now explain why we determine the index sets corresponding to the first d largest statistics as active index sets. Take MAX-SIS for example. Given c and τ , the cardinality of Â * is determined, which is denoted as d 0 . According to Theorem

Simulation studies
In this section, we conduct simulation studies to assess the performances of REC-SIS, ADD-SIS, DOM-SIS and MAX-SIS by comparing with PC-SIS 19 .
For each genetic model, the dimension of SNPs is m = 10 5 . Since the sample size, the case-to-control ratio and the minor allelic frequency (MAF) 20 can affect the association analysis in a case-control study, we consider different settings on them. To be specific, we choose the sample size n from {1500, 3000, 4500} , the caseto-control ratio w = p : q from {1, 1/3, 1/5} and MAF α from {0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45} . Because only the counts of genotypes are needed to calculate the statistics of interest, there is no need to generate original samples {(g l , y l ), l = 1, 2, · · · , n} in the simulation studies. Instead, we can just generate the count data from the trinomial distribution for each dataset. For the kth genetic variant (SNP), the count vector of three genotypes for case samples (r 0k , r 1k , r 2k ) follows the trinomial distribution Mul(np, p 0k /p, p 1k /p, p 2k /p) and that for control samples (s 0k , s 1k , s 2k ) follows the trinomial distribution Mul(nq, q 0k /q, q 1k /q, q 2k /q) , where p 0k + p 1k + p 2k = p, q 0k + q 1k + q 2k = q.
In each dataset, the first six SNPs are set to be related with Y and the rest SNPs are independent of Y. For the control sample, the count vector of each SNP G k (k ∈ {1, 2, . . . , 10 5 }) (s 0k , s 1k , s 2k ) is generated from the trinomial distribution Mul(nq, q 0k /q, q 1k /q, q 2k /q) , where q 0k = q(1 − α) 2 , q 1k = 2qα(1 − α), q 2k = qα 2 with α being the MAF. For the case sample, the count vector of each irrelevant SNP G k , (k ∈ {7, 8, . . . , 10 5 }) (r 0k , r 1k , r 2k ) is generated from Mul(np, p 0k /p, p 1k /p, p 2k /p) with p ik /p = q ik /q, i = 0, 1, 2 ; while the count vector for each relevant SNP G k (k ∈ {1, 2, . . . , 6}) (r 0k , r 1k , r 2k ) is generated from the trinomial distribution Mul(np, p 0k /p, p 1k /p, p 2k /p) , where (p 0k , p 1k , p 2k ) are functions of (q 0k , q 1k , q 2k ) and are diverse for different genetic models. Four different genetic models are considered, that is, recessive genetic model, additive genetic model, dominant genetic model and mixture of them, which are denoted as Model I, Model II, Model III and Model IV as follows, respectively. Under each genetic model, 500 repetitions are conducted to compare the performances of different methods. We employ two criteria to measure the effectiveness of each screening approach. One is the proportion for each relevant SNP G k , k ∈ A that is selected among all the 500 repetitions and is denoted as P k s . The other is the proportion that all the relevant SNPs are simultaneous selected among these 500 repetitions, which is denoted as P a .
Model I. Data are generated from the recessive genetic model. For the relevant SNPs G k , (k = 1, 2, . . . , 6) , q 0k +q 1k + q 2k , with = 1.8. Model II. Data are generated from the additive genetic model. For the relevant SNPs G k , (k = 1, 2, . . . , 6) , Data are generated from the dominant genetic model. For the relevant SNPs G k , (k = 1, 2, . . . , 6) , p 0k = pq 0k q 0k + q 1k + q 2k , p 1k = p q 1k q 0k + q 1k + q 2k , p 2k = p q 2k q 0k + q 1k + q 2k , with = 1.6. Model IV. Data are generated from the mixture of three genetic models. Relevant SNPs G 1 and G 2 are generated as those in Model I, relevant SNPs G 3 and G 4 are generated as those in Model II and relevant SNPs G 5 and G 6 are generated as those in Model III.
For each model, the proportions P k s , k = 1, 2, . . . , 6 and P a are calculated with the constant d = [n/ log n] , where [a] denotes the integer part of a. The results are plotted in Figs. 1, 2, 3 and 5. Since in Models I, II and III, the first six relevant SNPs are generated from the same distribution, P k s , k = 1, 2, . . . 6 are similar in these models. Therefore, we only plot the results for P 1 s in Figs. 1, 2 and 3. In Model IV, the relevant SNPs are generated from different genetic models, so the results for P k s , k = 1, 2, . . . 6 are plotted in Fig. 3. Besides, the results for P a are all plotted in Figs. 1, 2, 3 and 5.
Results in Fig. 1 correspond to the recessive genetic model. It can be seen that REC-SIS performs the best, MAX-SIS comes the second, and DOM-SIS is the worst. As Fig. 1 illuminates,  www.nature.com/scientificreports/ the screening approaches increases as sample size, the case-to-control ratio and MAF increase. In addition, it shows that PC-SIS almost fails to detect the relevant SNPs when MAF is less than 0.3. The simulation results for Model II are presented in Fig. 2. It shows that when the underlying genetic model is exactly additive genetic model, ADD-SIS performs best, MAX-SIS ranks the second. As Fig. 2 displays, the ability to detect relevant SNPs for all the screening approaches increases as the sample size and the case-to-control ratio increase. REC-SIS has low powers when MAF is small. The detection proportions of DOM-SIS first increase and then decrease slightly as MAF increases. In general, the detection proportions of MAX-SIS and ADD-SIS increase as MAF becomes larger. Whereas, the detection proportions of PC-SIS first increase slightly and then decrease dramatically as MAF increases.
The results for Model III are exhibited in Fig. 3. It shows that when the underlying genetic model is exactly dominant genetic model, DOM-SIS performs the best and REC-SIS can hardly work. As shown in Fig. 3 , the ability of detecting G 1 for all the screening approaches increases as the sample size and the case-to-control ratio increase. Furthermore, when MAF is greater than 0.25, the detection proportions for all the methods except REC-SIS decline as MAF increases. From the right subplot of Fig. 3, we can see that the performances of ADD-SIS and PC-SIS are greatly influenced by MAF, while those of DOM-SIS and MAX-SIS are robust against MAF.
As for Model IV, since the effects of sample size and case-to-control ratio on the performances of different methods have been demonstrated in the above three models, we take n = 3000, w = 0.2 as representative to demonstrate the effects of different genetic models. The results of P k s , k = 1, 2, . . . , 6 when n = 3000, w = 0.2 are illustrated in the left subplot of Fig. 4 and the results of P a under all scenarios are shown in the right subplot  From the simulation results above, we can see that sample size and case-to-control ratio are two important factors that affect the association analysis. It is rational that increasement in sample size can enhance the efficiency in identifying associated SNPs. As for case-to-control ratio, when the ratio approaches 1, all the methods have better performances than conditions with larger ratios. Given the size for case sample, increasing the size for control sample has little contribution on the performances of all the methods. For example, when w = p : q = 1/3, n = 3000 and w = p : q = 1 : 5, n = 4500 , that is when the case sample size r = 750 , and the control sample size s = 2250 and 3750 respectively, the selection proportions of all the five screening methods have similar results no matter how MAF varies. The effect of MAF is not monotonic. In recessive model, the selection proportions of all the five methods increase as MAF increases. However, in other models, the selection proportions of some methods first increase and later decrease as MAF increases. Under all the scenarios considered, MAX-SIS is the most robust method among these five screening methods.  The left subplot is for P k s , k = 1, 2, . . . , 6 among 500 repetitions, when sample size is 3000 and case-to-control ratio is 0.2. The right subplot is for P a among 500 repetitions. www.nature.com/scientificreports/ Overall, we can come to the conclusion that if all the candidate SNPs follow the same known genetic model, one of REC-SIS, ADD-SIS, DOM-SIS performs the best. However, the genetic model is always complicated and unknown in practice. In this case, MAX-SIS is recommended to reach robustness and efficiency.

Application to a real dataset
We apply the proposed screening procedures to a real case-control data of type 1 diabetes for British people 1 . The data contains 459,446 SNPs for 2938 controls and 1963 cases. Since there exist some missing values in the genotype data, the number of observed genotypes for a single SNP varies across all the SNPs. Count the number of missing values for each partially observed SNP. And it shows that the average number and the largest number of all these counts is 16.72 and 503, respectively, and the 25%, 50%, 75% quantile of these counts are 4, 7 and 13, respectively. To make aCATT and aMAX statistics have similar consistency rates for all the SNPs, SNPs with missing ratio large than 1% are deleted. Besides, SNPs with only two genotypes being observed are also removed from the dataset, resulting in 352,659 SNPs to be analyzed. For each SNP, the allele with lower frequency is treated as the risk allele. We use REC-SIS, ADD-SIS, DOM-SIS, MAX-SIS and PC-SIS to screen out the redundant SNPs, with the parameter d being [4901/ log(4901)] = 576. The results are shown in the venn diagram in Fig. 5 to display the screening results of all the five procedures. It shows that 242 SNPs are selected by all the procedures. Among these SNPs, SNPs rs9272346 and rs9272346 have been reported to be associated with type 1 diabetes 1 . This indicates that there may be some important association information contained in these SNPs which need to be further investigated. We list these 242 SNPs in Table 3.

Conclusion
Screening SNPs in case-control study is a commonly encountered task in modern biomedical research. And CATT and MAX statistics are the most widely used screening measures for this issue. However, the theoretical guarantees for the application of CATT and MAX to SNP screening have not been investigated. We fill this gap by adjusting CATTs and MAX test, and proposing screening procedures based on the adjusted statistics. Sure screening properties and ranking consistency properties of these screening procedures are proved. Simulation results show that when the underlying genetic model is unknown, which is often the case in practice, MAX-SIS performs the best.
Despite of the high efficiency of the proposed procedures, there exist some factors that affect their performances. First, numerical simulations show that when both MAF and sample size are small, REC-SIS, ADD-SIS, DOM-SIS, MAX-SIS and PC-SIS all perform badly. This is because that under this situation, the number of samples possessed with minor alleles is too small to provide enough information for the association analysis. Second, it is obviously that the value of the parameter d influence the performances of different methods. We determine the value of d based on works in the previous literatures. Since how to choose an optimal d is not the focus of this work, we will conduct more detailed analysis further. Third, when there exist covariates to be adjusted for, new procedures need to be developed, which will be studied in a future work. www.nature.com/scientificreports/

Data availibility
All data included in this study are available upon request by contacting with the corresponding author. To facilitate the usage for the proposed methods, the codes are available upon request by contacting with the corresponding author.